Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UTF-8 characters in feature name again #2976

Merged
merged 24 commits into from
Apr 10, 2020

Conversation

henry0312
Copy link
Contributor

@henry0312 henry0312 commented Apr 6, 2020

This commit reverts 0d59859.
Also see:

I reproduced the issue and as @kidotaka gave us a great survey in #2226,
I don't conclude that the cause is UTF-8, but "an empty string (character)".
Therefore, I revert "throw error when meet non ascii (#2229)" whose commit hash
is 0d59859, and add support feature names as UTF-8 again.


Sample codes

reprodude the issue

The below code raises lightgbm.basic.LightGBMError: Wrong size of feature_names

import lightgbm
import numpy
from matplotlib import pyplot

numpy.random.seed(42)

train_x= numpy.random.normal(size=(1000, 4))
valid_x= numpy.random.normal(size=(100, 4))
train_t = numpy.random.random(1000)
valid_t = numpy.random.random(100)

train_lgb = lightgbm.Dataset(train_x, train_t)
valid_lgb = lightgbm.Dataset(valid_x, valid_t, reference=train_lgb)

# This has non-ascii strings but an empty string.
feature_names = ['F_零', 'F_一', 'F_二', '']

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0,
    'seed': 42,
}

print('Starting training...')
evals_result = {}
gbm = lightgbm.train(params,
                     train_lgb,
                     num_boost_round=20,
                     valid_sets=valid_lgb,
                     feature_name=feature_names,
                     evals_result=evals_result,
                     early_stopping_rounds=5)

print('Plotting feature importances...')
ax = lightgbm.plot_importance(gbm, ignore_zero=False)
pyplot.show()

use utf-8 characters

You can see there is no problem with using utf-8 in feature names.

import lightgbm
import numpy
from matplotlib import pyplot

numpy.random.seed(42)

train_x= numpy.random.normal(size=(1000, 4))
valid_x= numpy.random.normal(size=(100, 4))
train_t = numpy.random.random(1000)
valid_t = numpy.random.random(100)

train_lgb = lightgbm.Dataset(train_x, train_t)
valid_lgb = lightgbm.Dataset(valid_x, valid_t, reference=train_lgb)

# This has non-ascii strings.
feature_names = ['F_零', 'F_一', 'F_二', 'F_三']

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0,
    'seed': 42,
}

print('Starting training...')
evals_result = {}
gbm = lightgbm.train(params,
                     train_lgb,
                     num_boost_round=20,
                     valid_sets=valid_lgb,
                     feature_name=feature_names,
                     evals_result=evals_result,
                     early_stopping_rounds=5)

print('Plotting feature importances...')
ax = lightgbm.plot_importance(gbm, ignore_zero=False)
pyplot.show()

This commit reverts 0d59859.
Also see:
- microsoft#2226
- microsoft#2478
- microsoft#2229

I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226,
I don't conclude that the cause is UTF-8, but "an empty string (character)".
Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash
is 0d59859, and add support feture names as UTF-8 again.
@guolinke
Copy link
Collaborator

guolinke commented Apr 6, 2020

@henry0312 did you test for this?

feature_names_ = Common::Split(key_vals["feature_names"].c_str(), ' ');

you can test that by save model to string/file, and load it back.

@henry0312
Copy link
Contributor Author

henry0312 commented Apr 6, 2020

@guolinke I just tried and confirm that there is no problem.

import lightgbm
import numpy
from matplotlib import pyplot

numpy.random.seed(42)

train_x= numpy.random.normal(size=(1000, 4))
valid_x= numpy.random.normal(size=(100, 4))
train_t = numpy.random.random(1000)
valid_t = numpy.random.random(100)

train_lgb = lightgbm.Dataset(train_x, train_t)
valid_lgb = lightgbm.Dataset(valid_x, valid_t, reference=train_lgb)

# This has non-ascii strings.
feature_names = ['F_零', 'F_一', 'F_二', 'F_三']

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0,
    'seed': 42,
}

print('Starting training...')
evals_result = {}
gbm = lightgbm.train(params,
                     train_lgb,
                     num_boost_round=20,
                     valid_sets=valid_lgb,
                     feature_name=feature_names,
                     evals_result=evals_result,
                     early_stopping_rounds=5)

# feature names
print('Feature names:', gbm.feature_name())

print('Saving model...')
# save model to file
gbm.save_model('model.txt')

print('Loading model to predict...')
# load model to predict
gbm2 = lightgbm.Booster(model_file='model.txt')

# feature names
print('Feature names:', gbm2.feature_name())
❯ python test_utf8.py
Starting training...
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000196 seconds.
You can set `force_col_wise=true` to remove the overhead.
[1]	valid_0's l1: 0.234781	valid_0's l2: 0.0734969
Training until validation scores don't improve for 5 rounds
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[2]	valid_0's l1: 0.235442	valid_0's l2: 0.0736321
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[3]	valid_0's l1: 0.236002	valid_0's l2: 0.0738012
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[4]	valid_0's l1: 0.236452	valid_0's l2: 0.0738955
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[5]	valid_0's l1: 0.236552	valid_0's l2: 0.0739692
[6]	valid_0's l1: 0.236756	valid_0's l2: 0.0739201
Early stopping, best iteration is:
[1]	valid_0's l1: 0.234781	valid_0's l2: 0.0734969
Feature names: ['F_零', 'F_一', 'F_二', 'F_三']
Saving model...
Loading model to predict...
Feature names: ['F_零', 'F_一', 'F_二', 'F_三']

EDIT: 2020-04-06 19:07

@guolinke
Copy link
Collaborator

guolinke commented Apr 6, 2020

great! I think we can add more test cases for utf8 feature names.

@henry0312
Copy link
Contributor Author

@guolinke Sure! Please wait a moment.

@henry0312 henry0312 requested a review from wxchan as a code owner April 6, 2020 13:01
@StrikerRUS
Copy link
Collaborator

I think we should revisit R-package as well to ensure it supports non-ASCII names.
Ping @jameslamb and @Laurae2 for this.

@henry0312 henry0312 removed the request for review from wxchan April 6, 2020 13:33
@henry0312
Copy link
Contributor Author

I think we should revisit R-package as well to ensure it supports non-ASCII names.

As I remember, we hadn't had any problems with feature names before 0d59859.

@henry0312
Copy link
Contributor Author

henry0312 commented Apr 6, 2020

By the way, we must drop support of Python 2.x as soon as possible.
It is really evil for string as UTF-8.

@jameslamb
Copy link
Collaborator

I think we should revisit R-package as well to ensure it supports non-ASCII names.

As I remember, we hadn't had any problems with feature names before 0d59859.

@henry0312 I'm surprised by the R test errors you got, but they don't seem related to this PR and the fixes are ok with me generally (I'll leave a comment)

I'll add some test code for R here in a few minutes

@jameslamb
Copy link
Collaborator

I'm looking at the logs and really don't understand these errors, or what is different between the CI environment and my laptop. I'm running the tests from a shell so it's not like my local environment is benefitting from weird RStudio magic with the environment.

The new test is failing

image

But so are 8 others

image

@henry0312
Copy link
Contributor Author

F_<e9><9b><b6> is encoded by UTF-8.
F_\u96f6 is unicode.
https://www.compart.com/en/unicode/U+96F6
so you need to check the actual character (零).

@henry0312
Copy link
Contributor Author

json library of R may encode strings, although I don't know much about R.

@henry0312
Copy link
Contributor Author

I think functions around https://rlang.r-lib.org/reference/chr_unserialise_unicode.html are related.

@StrikerRUS
Copy link
Collaborator

If it is not trivial to enable UTF-8 support in R, I think we can split Python and R changes in separate PRs.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Apr 6, 2020

I found the same TeX issue, https://tug.org/pipermail/tex-live/2020-April/045230.html.
It seems a temporary problem 😓

Do we need to report the current issue somewhere, or they are aware as it is "every year" problem?

@henry0312
Copy link
Contributor Author

henry0312 commented Apr 7, 2020

Finally, all tests have passed (but include some work-arounds, which are not related to Python😅)
@guolinke @StrikerRUS @jameslamb this is ready for merge.

I found the same TeX issue, https://tug.org/pipermail/tex-live/2020-April/045230.html.
It seems a temporary problem 😓

Do we need to report the current issue somewhere, or they are aware as it is "every year" problem?

No need, but we will continue to check the thread.

@StrikerRUS
Copy link
Collaborator

@henry0312 I extracted the workarounds not related to this PR in #2977 and just merged it into master.

@henry0312 henry0312 self-assigned this Apr 7, 2020
@henry0312
Copy link
Contributor Author

Followed the current master/HEAD

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the following decodes?

ret = json.loads(string_buffer.value.decode())

self.__name_inner_eval = \
[string_buffers[i].value.decode() for i in range_(self.__num_inner_eval)]

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
@jameslamb jameslamb self-requested a review April 7, 2020 20:57
@jameslamb
Copy link
Collaborator

Looks ok to me from R side! But my approval shouldn't count towards a merge.

@henry0312
Copy link
Contributor Author

What about the following decodes?

ret = json.loads(string_buffer.value.decode())

self.__name_inner_eval = \
[string_buffers[i].value.decode() for i in range_(self.__num_inner_eval)]

@StrikerRUS yeah, your points are right. .decode('utf-8') should be needed.

Looks ok to me from R side! But my approval shouldn't count towards a merge.

@jameslamb can you approve again?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one minor comment!

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
@henry0312 henry0312 merged commit 44a9120 into microsoft:master Apr 10, 2020
@henry0312 henry0312 deleted the support_utf-8 branch April 10, 2020 03:54
@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants